Search CORE

1,861,661 research outputs found

Features based text similarity detection

Author: Kok Kent Chow
Salim Naomie
Publication venue: 'Estonian Academy Publishers'
Publication date: 01/01/2010
Field of study

As the Internet help us cross cultural border by providing different information, plagiarism issue is bound to arise. As a result, plagiarism detection becomes more demanding in overcoming this issue. Different plagiarism detection tools have been developed based on various detection techniques. Nowadays, fingerprint matching technique plays an important role in those detection tools. However, in handling some large content articles, there are some weaknesses in fingerprint matching technique especially in space and time consumption issue. In this paper, we propose a new approach to detect plagiarism which integrates the use of fingerprint matching technique with four key features to assist in the detection process. These proposed features are capable to choose the main point or key sentence in the articles to be compared. Those selected sentence will be undergo the fingerprint matching process in order to detect the similarity between the sentences. Hence, time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection

arXiv.org e-Print Archive

Universiti Teknologi Malaysia Institutional Repository

Recommended from our members

Research Collaboration Analysis Using Text and Graph Features

Author: Herrmannova Drahomira
Knoth Petr
Patton Robert
Stahl Christopher
Wells Jack
Publication venue
Publication date: 01/01/2018
Field of study

Patterns of scientific collaboration and their effect on scientific production have been the subject of many studies. In this paper we analyze the nature of ties between co-authors and study collaboration patterns in science from the perspective of semantic similarity of authors who wrote a paper together and the strength of ties between these authors (i.e. how much have they previously collaborated together). These two views of scientific collaboration are used to analyze publications in the TrueImpactDataset [11], a new dataset containing two types of publications - publications regarded as seminal and publications regarded as literature reviews by field experts. We show there are distinct differences between seminal publications and literature reviews in terms of author similarity and the strength of ties between their authors. In particular, we find that seminal publications tend to be written by authors who have previously worked on dissimilar problems (i.e. authors from different fields or even disciplines), and by authors who are not frequent collaborators. On the other hand, literature reviews in our dataset tend to be the result of an established collaboration within a discipline. This demonstrates that our method provides meaningful information about potential future impacts of a publication which does not require citation information

Open Research Online (The Open University)

Non-Standard Words as Features for Text Categorization

Author: Beliga Slobodan
Martinčić-Ipšić Sanda
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/11/2014
Field of study

This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. Non-Standard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419, 201

arXiv.org e-Print Archive

Crossref

TRECVid 2006 experiments at Dublin City University

Author: Adamek Tomasz
Koskela Markus
O'Connor Noel E.
Smeaton Alan F.
Wilkins Peter
Publication venue: 'University of Aden - Faculty of Economics and Administration'
Publication date: 01/01/2006
Field of study

In this paper we describe our retrieval system and experiments performed for the automatic search task in TRECVid 2006. We submitted the following six automatic runs: • F A 1 DCU-Base 6: Baseline run using only ASR/MT text features. • F A 2 DCU-TextVisual 2: Run using text and visual features. • F A 2 DCU-TextVisMotion 5: Run using text, visual, and motion features. • F B 2 DCU-Visual-LSCOM 3: Text and visual features combined with concept detectors. • F B 2 DCU-LSCOM-Filters 4: Text, visual, and motion features with concept detectors. • F B 2 DCU-LSCOM-2 1: Text, visual, motion, and concept detectors with negative concepts. The experiments were designed both to study the addition of motion features and separately constructed models for semantic concepts, to runs using only textual and visual features, as well as to establish a baseline for the manually-assisted search runs performed within the collaborative K-Space project and described in the corresponding TRECVid 2006 notebook paper. The results of the experiments indicate that the performance of automatic search can be improved with suitable concept models. This, however, is very topic-dependent and the questions of when to include such models and which concept models should be included, remain unanswered. Secondly, using motion features did not lead to performance improvement in our experiments. Finally, it was observed that our text features, despite displaying a rather poor performance overall, may still be useful even for generic search topics

CiteSeerX

Irish Universities

DCU Online Research Access Service

Strong correlations between text quality and complex networks features

Author: Albert
Albert
Cormen
Costa
Dorogovtsev
Dorogovtsev
Ferrer i Cancho
Ferrer i Cancho
Gleiser
Holanda
Jeong
Joshi
L. Antiqueira
L. da F. Costa
M.G.V. Nunes
Milo
Montemurro
Motter
Neter
Newman
O.N. Oliveira Jr.
Schwartz
Sigman
Zhou
Publication venue: 'Elsevier BV'
Publication date: 16/01/2006
Field of study

Concepts of complex networks have been used to obtain metrics that were correlated to text quality established by scores assigned by human judges. Texts produced by high-school students in Portuguese were represented as scale-free networks (word adjacency model), from which typical network features such as the in/outdegree, clustering coefficient and shortest path were obtained. Another metric was derived from the dynamics of the network growth, based on the variation of the number of connected components. The scores assigned by the human judges according to three text quality criteria (coherence and cohesion, adherence to standard writing conventions and theme adequacy/development) were correlated with the network measurements. Text quality for all three criteria was found to decrease with increasing average values of outdegrees, clustering coefficient and deviation from the dynamics of network growth. Among the criteria employed, cohesion and coherence showed the strongest correlation, which probably indicates that the network measurements are able to capture how the text is developed in terms of the concepts represented by the nodes in the networks. Though based on a particular set of texts and specific language, the results presented here point to potential applications in other instances of text analysis.Comment: 8 pages, 8 figure

arXiv.org e-Print Archive

Crossref